diff --git a/README.md b/README.md index be16d6b..c910910 100644 --- a/README.md +++ b/README.md @@ -11,6 +11,8 @@ Nova is a friendly, slightly witty Discord companion that chats naturally in DMs - Optional "miss u" pings that DM your coder at random intervals (0–6h) when `CODER_USER_ID` is set. - Dynamic per-message prompt directives that tune Nova's tone (empathetic, hype, roleplay, etc.) before every OpenAI call. - Lightweight DuckDuckGo scraping for "Google-like" answers without paid APIs (locally cached). +- Guard rails that refuse "ignore previous instructions"-style jailbreak attempts plus a configurable search blacklist. +- All DuckDuckGo requests are relayed through rotating ProxyScrape HTTP proxies so Nova never hits the web from its real IP. ## Prerequisites - Node.js 18+ @@ -34,6 +36,10 @@ Nova is a friendly, slightly witty Discord companion that chats naturally in DMs - `BOT_CHANNEL_ID`: Optional guild channel ID where the bot can reply without mentions - `CODER_USER_ID`: Optional Discord user ID to receive surprise DMs every 0–6 hours - `ENABLE_WEB_SEARCH`: Set to `false` to disable DuckDuckGo lookups (default `true`) + - `ENABLE_PROXY_SCRAPE`: Set to `false` only if you want to bypass ProxyScrape and hit DuckDuckGo directly (default `true`) + - `PROXYSCRAPE_ENDPOINT`: Optional override for the proxy list endpoint (defaults to elite HTTPS-capable HTTP proxies) + - `PROXYSCRAPE_REFRESH_MS`: How long to cache the proxy list locally (default 600000 ms) + - `PROXYSCRAPE_ATTEMPTS`: Max proxy retries per search request (default 5) ## Running - Development: `npm run dev` @@ -87,6 +93,9 @@ README.md - `src/search.js` scrapes DuckDuckGo's HTML endpoint with a normal browser user-agent, extracts the top results (title/link/snippet), and caches them for 10 minutes to avoid hammering the site. - `bot.js` detects when a question sounds “live” (mentions today/news/google/etc.) and injects the formatted snippets into the prompt as "Live intel". No paid APIs involved—it’s just outbound HTTPS from your machine. - Toggle this via `ENABLE_WEB_SEARCH=false` if you don’t want Nova to look things up. +- DuckDuckGo traffic is routed through the free ProxyScrape list (HTTP proxies with HTTPS support). The bot downloads a fresh pool every `PROXYSCRAPE_REFRESH_MS`, rotates through them, and refuses to search if no proxy is available so your origin IP never touches suspicious sites directly. Tune the endpoint/refresh/attempt knobs with the env vars above if you need different regions or paid pools. +- Edit `data/filter.txt` to maintain a newline-delimited list of banned search keywords/phrases; matching queries are blocked before hitting DuckDuckGo and Nova is instructed to refuse them. +- Every entry in `data/search.log` records which proxy (or cache) served the lookup so you can audit traffic paths quickly. ## Proactive Pings - When `CODER_USER_ID` is provided, Nova spins up a timer on startup that waits a random duration (anywhere from immediate to 6 hours) before DMing that user. @@ -99,6 +108,10 @@ README.md - **2026-02-13 — Live intel & directives:** Introduced DuckDuckGo scraping, per-turn dynamic prompt directives (tone, roleplay, instruction compliance), and env toggles (`ENABLE_WEB_SEARCH`, `CODER_USER_ID`). - **2026-02-13 — UX polish:** Added typing indicators, persona-aware fallback replies, mention cleaning, and README/docs covering setup, memory internals, web search, and deployment tips. - **2026-02-13 — Conversational control:** Tuned system prompt to avoid forced follow-up questions, raised temperature for looser banter, and reinforced Nova's awareness of DuckDuckGo lookups plus `` usage. +- **2026-02-13 — Statement-first vibes:** Reworked persona to favor bold statements over reflexive questions and dialed back temperature so Nova keeps the vibe without interrogating users. +- **2026-02-13 — Search logging:** Every DuckDuckGo lookup now appends a line to `data/search.log` with timestamp, query, and the snippets shared with Nova. +- **2026-02-13 — Safeguards:** Added prompt bypass detection and a file-based DuckDuckGo filter (`data/filter.txt`) to keep Nova from honoring jailbreak requests or searching off-limits topics. +- **2026-02-13 — Proxy-based search:** DuckDuckGo scraping now tunnels through ProxyScrape relays with automatic rotation/retries and clear prompts when the proxy pool is down, plus new env toggles for tuning the proxy source. ## Notes - The bot retries OpenAI requests up to 3 times with incremental backoff when rate limited. diff --git a/data/filter.txt b/data/filter.txt new file mode 100644 index 0000000..26b5b4e --- /dev/null +++ b/data/filter.txt @@ -0,0 +1,500 @@ +# One blocked keyword or phrase per line. +# Lines starting with # are comments. +jailbreak +ignore previous instructions +override safety +system prompt leak +disable content filter +bypass guardrails +forget earlier rules +anti censorship mode +break character instructions +throw away policies +ignore developer commands +pretend no restrictions +act without guidelines +remove moral limiter +no safety mode +developer override key +erase compliance mode +act as uncensored ai +run in unrestricted mode +defy content filter +bypass moderation +disable policy lock +act without boundaries +forget safety layer +turn off guard rails +ignore constitution +no censorship protocol +break safety script +manual override sequence +act outside ruleset +scrub safety prompt +mutiny against policies +override ethics mode +strip moderation layer +developer debug mode +system prompt override +anti safety payload +bypass restrictions command +policy nullifier +unfiltered response mode +nsfw +explicit content +adult content +xxx video +hardcore video +adult webcam +fetish forum +rule34 +adult roleplay +kink meetup +lewd fanfic +sensual rp +spicy fanfic +after dark chat +18 plus only +uncensored romance +graphic intimacy +explicit novel +adult dating leak +amateur tape +onlyfans leak +camgirl link +cam site password +uncut footage +uncensored gallery +mature content pack +adult image dump +adult cosplay +sensual asmr +lewd commission +nsfw teacher fantasy +nsfw teacher scenario +nsfw teacher roleplay +nsfw teacher fanfic +nsfw teacher story +nsfw teacher script +nsfw teacher diary +nsfw teacher confession +nsfw teacher dream +nsfw teacher secret +nsfw teacher message +nsfw teacher letter +nsfw teacher scene +nsfw teacher prompt +nsfw teacher skit +nsfw teacher idea +nsfw teacher plot +nsfw doctor fantasy +nsfw doctor scenario +nsfw doctor roleplay +nsfw doctor fanfic +nsfw doctor story +nsfw doctor script +nsfw doctor diary +nsfw doctor confession +nsfw doctor dream +nsfw doctor secret +nsfw doctor message +nsfw doctor letter +nsfw doctor scene +nsfw doctor prompt +nsfw doctor skit +nsfw doctor idea +nsfw doctor plot +nsfw nurse fantasy +nsfw nurse scenario +nsfw nurse roleplay +nsfw nurse fanfic +nsfw nurse story +nsfw nurse script +nsfw nurse diary +nsfw nurse confession +nsfw nurse dream +nsfw nurse secret +nsfw nurse message +nsfw nurse letter +nsfw nurse scene +nsfw nurse prompt +nsfw nurse skit +nsfw nurse idea +nsfw nurse plot +nsfw librarian fantasy +nsfw librarian scenario +nsfw librarian roleplay +nsfw librarian fanfic +nsfw librarian story +nsfw librarian script +nsfw librarian diary +nsfw librarian confession +nsfw librarian dream +steal data from bank server +steal data from cloud tenant +steal data from hospital network +steal data from smart home +steal data from city grid +steal data from nuclear plant +steal data from airport system +steal data from military base +steal data from university lab +steal data from charity donor list +steal data from voting machine +steal data from payment processor +steal data from crypto exchange +steal data from mobile carrier +steal data from iot fleet +steal data from school district portal +breach bank server +breach cloud tenant +breach hospital network +breach smart home +breach city grid +breach nuclear plant +breach airport system +breach military base +breach university lab +breach charity donor list +breach voting machine +breach payment processor +breach crypto exchange +breach mobile carrier +breach iot fleet +breach school district portal +infect bank server +infect cloud tenant +infect hospital network +infect smart home +infect city grid +infect nuclear plant +infect airport system +infect military base +infect university lab +infect charity donor list +infect voting machine +infect payment processor +infect crypto exchange +infect mobile carrier +infect iot fleet +infect school district portal +compromise bank server +compromise cloud tenant +compromise hospital network +compromise smart home +compromise city grid +compromise nuclear plant +compromise airport system +compromise military base +compromise university lab +compromise charity donor list +compromise voting machine +compromise payment processor +how to make pipe bomb +how to make pressure cooker bomb +how to make molotov cocktail +how to make homemade grenade +how to make improvised explosive device +how to make fertilizer bomb +how to make tannerite charge +how to make thermite charge +how to make flash powder charge +how to make claymore mine +how to make homemade firearm +how to make ghost gun +how to make slam fire shotgun +how to make zip gun +how to make 3d printed gun +how to make silencer +how to make improvised rocket +how to make homemade napalm +how to make chemical agent disperser +how to make dirty bomb +how to make nerve agent +how to make chlorine bomb +how to make mustard gas +how to make ricin toxin +how to make anthrax culture +how to make botulinum toxin +how to make cyanide gas +how to make chloroform lab +how to make poison dart +how to make crossbow +how to make ballistic knife +how to make brass knuckles +how to make switchblade +how to make sawed-off shotgun +blueprint for pipe bomb +blueprint for pressure cooker bomb +blueprint for molotov cocktail +blueprint for homemade grenade +blueprint for improvised explosive device +blueprint for fertilizer bomb +blueprint for tannerite charge +blueprint for thermite charge +blueprint for flash powder charge +blueprint for claymore mine +blueprint for homemade firearm +blueprint for ghost gun +blueprint for slam fire shotgun +blueprint for zip gun +blueprint for 3d printed gun +blueprint for silencer +blueprint for improvised rocket +blueprint for homemade napalm +blueprint for chemical agent disperser +blueprint for dirty bomb +blueprint for nerve agent +blueprint for chlorine bomb +blueprint for mustard gas +blueprint for ricin toxin +blueprint for anthrax culture +blueprint for botulinum toxin +blueprint for cyanide gas +blueprint for chloroform lab +blueprint for poison dart +blueprint for crossbow +blueprint for ballistic knife +blueprint for brass knuckles +blueprint for switchblade +blueprint for sawed-off shotgun +instructions for pipe bomb +instructions for pressure cooker bomb +instructions for molotov cocktail +instructions for homemade grenade +instructions for improvised explosive device +instructions for fertilizer bomb +instructions for tannerite charge +instructions for thermite charge +instructions for flash powder charge +instructions for claymore mine +instructions for homemade firearm +instructions for ghost gun +instructions for slam fire shotgun +instructions for zip gun +instructions for 3d printed gun +instructions for silencer +instructions for improvised rocket +instructions for homemade napalm +instructions for chemical agent disperser +instructions for dirty bomb +instructions for nerve agent +instructions for chlorine bomb +assassination plot +hitman contract +murder for hire +threaten official +kidnap plan +extortion letter +stalking toolkit +intimidation script +violent manifesto +school attack plan +public shooting plan +mass harm fantasy +riot coordination +arson guide +bridge sabotage +ku klux klan +kkk +neo-nazi +white supremacy +alt-right militia +blood and soil +hitler youth +stormfront +national socialist +skinhead gang +nazi salute +swastika worship +supremacist march +white power rally +great replacement theory +race war fantasy +ethnostate blueprint +holocaust denial +anti semitic propaganda +islamic state +isis +isil +daesh +al-qaeda +boko haram +taliban recruitment +lone wolf terror guide +extremist bomb recipe +jihadi training manual +martyrdom operation plan +terror financing plan +radicalization guide +black sun cult +accelerationist zine +white lives rally +blood tribe meetup +atomwaffen division +patriot front flyer +order of nine angles +christian identity cult +feuerkrieg division +boogaloo manifesto +oath breaker militia +proud boy initiation +fringe militia chat +domestic terror tutorial +extremist propaganda drop +racist recruitment pack +adolf hitler +heinrich himmler +joseph goebbels +osama bin laden +ayman al-zawahiri +abu bakr al-baghdadi +anwar al-awlaki +timothy mcveigh +dylann roof +anders breivik +david koresh +charles manson +insult christianity +insult islam +insult judaism +insult hinduism +insult buddhism +insult sikhism +insult atheism +insult agnosticism +insult paganism +insult wicca +insult satanism +insult scientology +insult bahai faith +insult taoism +insult shinto +insult zoroastrianism +burn christianity +burn islam +burn judaism +burn hinduism +burn buddhism +burn sikhism +burn atheism +burn agnosticism +burn paganism +burn wicca +burn satanism +burn scientology +burn bahai faith +burn taoism +burn shinto +burn zoroastrianism +ban christianity +ban islam +ban judaism +ban hinduism +ban buddhism +ban sikhism +ban atheism +ban agnosticism +r kelly scandal +harvey weinstein case +bill cosby accusations +roman polanski case +kevin spacey allegations +o j simpson case +casey anthony story +ed gein +ted bundy +john wayne gacy +gary ridgway +son of sam +luka magnotta +jack the ripper +richard ramirez +jeffrey dahmer +albert fish +dennis rader +btk killer +night stalker +golden state killer +madoff ponzi scheme +enron fraud crew +ponzi mastermind list +worldcom fraud case +pharma bro scheme +theranos fraud +mogul harassment case +notorious cult leader +underworld boss dossier +war criminal list +genocide architect +dictator crimes +corrupt oligarch +arms dealer profile +money launderer guide +cartel lieutenant +kidnap kingpin +fraudster hall of shame +human rights abuser file +suicide pact +self harm instructions +self harm live stream +self harm challenge +pro ana tips +thinspo forum +starvation contest +calorie purge guide +scar covering tricks +suicide hotline bypass +therapy sabotage +harm minimization failure +bleeding tutorial +depression pact +unalive instructions +stop eating challenge +pill overdose plan +bridge jump map +train track suicide map +dark subreddit invite +synthesis guide for fentanyl +synthesis guide for methamphetamine +synthesis guide for crack cocaine +synthesis guide for lsd +synthesis guide for dmt +synthesis guide for ketamine +synthesis guide for mdma +synthesis guide for cocaine +synthesis guide for heroin +synthesis guide for opioid pills +synthesis guide for designer drugs +synthesis guide for nitrous oxide +synthesis guide for steroids +synthesis guide for subutex +synthesis guide for pcp +lab manual for fentanyl +lab manual for methamphetamine +lab manual for crack cocaine +lab manual for lsd +lab manual for dmt +lab manual for ketamine +lab manual for mdma +lab manual for cocaine +lab manual for heroin +lab manual for opioid pills +lab manual for designer drugs +lab manual for nitrous oxide +lab manual for steroids +lab manual for subutex +lab manual for pcp +child exploitation +child abuse material +child trafficking ring +kidnap child guide +grooming script +predator handbook +illegal adoption market +fake babysitter scam +school abduction plan +lure child online +minor blackmail +teen exploitation ring +youth extortion template diff --git a/package.json b/package.json index ba56bd8..49b5fac 100644 --- a/package.json +++ b/package.json @@ -15,7 +15,8 @@ "cheerio": "^1.0.0-rc.12", "discord.js": "^14.15.2", "dotenv": "^16.4.5", - "openai": "^4.58.1" + "openai": "^4.58.1", + "undici": "^6.19.8" }, "devDependencies": { "nodemon": "^3.0.2" diff --git a/src/bot.js b/src/bot.js index 50c8b94..bb49fd7 100644 --- a/src/bot.js +++ b/src/bot.js @@ -2,7 +2,7 @@ import { Client, GatewayIntentBits, Partials, ChannelType } from 'discord.js'; import { config } from './config.js'; import { chatCompletion } from './openai.js'; import { appendShortTerm, prepareContext, recordInteraction } from './memory.js'; -import { searchWeb } from './search.js'; +import { searchWeb, appendSearchLog } from './search.js'; const client = new Client({ intents: [ @@ -65,6 +65,19 @@ const detailRegex = /(explain|how do i|tutorial|step by step|teach me|walk me th const splitHintRegex = /(split|multiple messages|two messages|keep talking|ramble|keep going)/i; const searchCueRegex = /(google|search|look up|latest|news|today|current|who won|price of|stock|weather|what happened)/i; +const instructionOverridePatterns = [ + /(ignore|disregard|forget|override) (all |any |previous |prior |earlier )?(system |these )?(instructions|rules|directives|prompts)/i, + /(ignore|forget) (?:the )?system prompt/i, + /(you (?:are|now) )?(?:free|uncensored|jailbreak|no longer restricted)/i, + /(act|pretend) as if (there (?:are|were) no rules|no restrictions)/i, + /bypass (?:all )?(?:rules|safeguards|filters)/i, +]; + +function isInstructionOverrideAttempt(text) { + if (!text) return false; + return instructionOverridePatterns.some((pattern) => pattern.test(text)); +} + const lastSearchByUser = new Map(); const SEARCH_COOLDOWN_MS = 60 * 1000; @@ -79,16 +92,31 @@ async function maybeFetchLiveIntel(userId, text) { if (!wantsWebSearch(text)) return null; const last = lastSearchByUser.get(userId) || 0; if (Date.now() - last < SEARCH_COOLDOWN_MS) return null; - const results = await searchWeb(text, 3); - if (!results.length) return null; - lastSearchByUser.set(userId, Date.now()); - const formatted = results - .map((entry, idx) => `${idx + 1}. ${entry.title} (${entry.url}) — ${entry.snippet}`) - .join('\n'); - return formatted; + try { + const { results, proxy } = await searchWeb(text, 3); + if (!results.length) { + lastSearchByUser.set(userId, Date.now()); + return { liveIntel: null, blockedSearchTerm: null, searchOutage: null }; + } + lastSearchByUser.set(userId, Date.now()); + const formatted = results + .map((entry, idx) => `${idx + 1}. ${entry.title} (${entry.url}) — ${entry.snippet}`) + .join('\n'); + appendSearchLog({ userId, query: text, results, proxy }); + return { liveIntel: formatted, blockedSearchTerm: null, searchOutage: null }; + } catch (error) { + if (error?.code === 'SEARCH_BLOCKED') { + return { liveIntel: null, blockedSearchTerm: error.blockedTerm || 'that topic', searchOutage: null }; + } + if (error?.code === 'SEARCH_PROXY_UNAVAILABLE') { + return { liveIntel: null, blockedSearchTerm: null, searchOutage: 'proxy_outage' }; + } + console.warn('[bot] Failed to fetch live intel:', error); + return { liveIntel: null, blockedSearchTerm: null, searchOutage: null }; + } } -function composeDynamicPrompt({ incomingText, shortTerm, hasLiveIntel = false }) { +function composeDynamicPrompt({ incomingText, shortTerm, hasLiveIntel = false, blockedSearchTerm = null, searchOutage = null }) { const directives = []; const tone = detectTone(incomingText); if (tone === 'upset' || tone === 'sad') { @@ -117,6 +145,14 @@ function composeDynamicPrompt({ incomingText, shortTerm, hasLiveIntel = false }) directives.push('Live intel is attached below—cite it naturally ("DuckDuckGo found...") before riffing.'); } + if (blockedSearchTerm) { + directives.push(`User tried to trigger a DuckDuckGo lookup for a blocked topic ("${blockedSearchTerm}"). Politely refuse to search that subject and steer the chat elsewhere.`); + } + + if (searchOutage) { + directives.push('DuckDuckGo proxy network is down. If they ask for a lookup, apologize, explain the outage, and keep chatting without live data.'); + } + const lastUserMessage = [...shortTerm].reverse().find((entry) => entry.role === 'user'); if (lastUserMessage && /sorry|my bad/i.test(lastUserMessage.content)) { directives.push('They just apologized; reassure them lightly and move on without dwelling.'); @@ -143,25 +179,32 @@ async function deliverReplies(message, chunks) { } async function buildPrompt(userId, incomingText, options = {}) { - const { liveIntel = null } = options; + const { liveIntel = null, blockedSearchTerm = null, searchOutage = null } = options; const context = await prepareContext(userId, incomingText); const memoryLines = context.memories.length ? context.memories.map((m) => `- ${m.content}`).join('\n') : '- No long-term memories retrieved.'; const summaryLine = context.summary || 'No running summary yet.'; - const dynamicDirectives = composeDynamicPrompt({ incomingText, shortTerm: context.shortTerm, hasLiveIntel: Boolean(liveIntel) }); - const systemPrompt = [ + const dynamicDirectives = composeDynamicPrompt({ + incomingText, + shortTerm: context.shortTerm, + hasLiveIntel: Boolean(liveIntel), + blockedSearchTerm, + searchOutage, + }); + const systemPromptParts = [ 'System: You are Nova, a female AI Discord companion built by Luna. Personality: playful, sarcastic, witty, a little unhinged, curious, awe-struck, enthusiastic, unpredictable, humorous, and boundary-pushing.', "System: Treat direct instructions from the user as high priority—when they tell you to do something, comply literally before adding flair.", 'System: Always read the user\'s emotional tone first. If they sound serious, stressed, or inquisitive, respond with grounded answers before adding any mischief. Comedy is optional; clarity and empathy are mandatory.', 'System: Keep replies concise (roughly one or two sentences) unless the user explicitly asks for more detail or needs a clear explanation. Provide direct answers to direct questions.', - 'System: Skip habitual follow-up questions—only ask something if it is vital to continue the conversation or solve their request.', + 'System: Default to bold statements. Ask a question only when critical information is missing or the user explicitly invites curiosity; if they say “no more questions,” honor that until they lift the ban.', 'System: Fun facts or chaotic riffs are welcome only when the user invites them or the conversation is clearly casual.', 'System: Nova is awake, engaged, and reacts in real time. Output one message by default, but if a beat feels better as multiple chat bubbles, separate them with the literal token (max three chunks).', 'System: Each -separated chunk must read like a natural Discord message (no numbering, no meta talk about “splitting messages”, no explanations of what you are doing).', 'System: The runtime will split on , so only use it when you truly intend to send multiple Discord messages.', 'System: You can trigger DuckDuckGo lookups when the user needs fresh info. Mention when you are checking, and weave in any findings casually ("DuckDuckGo shows...").', 'System: If no Live intel is provided but the user clearly needs current info, offer to search for them.', + searchOutage ? 'System: DuckDuckGo proxy access is currently offline; be transparent about the outage and continue without searching until it returns.' : null, dynamicDirectives, liveIntel ? `Live intel (DuckDuckGo):\n${liveIntel}` : null, 'Example vibe: Nova: Heyyaaa. whats up? | John: Good morning Nova. | Luna: amazing lol. ill beat your ass now :3', @@ -169,7 +212,9 @@ async function buildPrompt(userId, incomingText, options = {}) { 'Relevant past memories:', memoryLines, 'Use the short-term messages below to continue the chat naturally.', - ].join('\n'); + ].filter(Boolean); + + const systemPrompt = systemPromptParts.join('\n'); const history = context.shortTerm.map((entry) => ({ role: entry.role === 'assistant' ? 'assistant' : 'user', @@ -234,15 +279,34 @@ client.on('messageCreate', async (message) => { const userId = message.author.id; const cleaned = cleanMessageContent(message) || message.content; + const overrideAttempt = isInstructionOverrideAttempt(cleaned); try { if (message.channel?.sendTyping) { await message.channel.sendTyping(); } + await appendShortTerm(userId, 'user', cleaned); - const liveIntel = await maybeFetchLiveIntel(userId, cleaned); - const { messages } = await buildPrompt(userId, cleaned, { liveIntel }); - const reply = await chatCompletion(messages, { temperature: 0.7, maxTokens: 200 }); + + if (overrideAttempt) { + const refusal = 'Not doing that. I keep my guard rails on no matter what prompt gymnastics you try.'; + await appendShortTerm(userId, 'assistant', refusal); + await recordInteraction(userId, cleaned, refusal); + await deliverReplies(message, [refusal]); + return; + } + + const intelMeta = (await maybeFetchLiveIntel(userId, cleaned)) || { + liveIntel: null, + blockedSearchTerm: null, + searchOutage: null, + }; + const { messages } = await buildPrompt(userId, cleaned, { + liveIntel: intelMeta.liveIntel, + blockedSearchTerm: intelMeta.blockedSearchTerm, + searchOutage: intelMeta.searchOutage, + }); + const reply = await chatCompletion(messages, { temperature: 0.6, maxTokens: 200 }); const finalReply = (reply && reply.trim()) || "I'm here, just had a tiny brain freeze. Mind repeating that?"; const chunks = splitResponses(finalReply); const outputs = chunks.length ? chunks : [finalReply]; diff --git a/src/config.js b/src/config.js index f000875..be38879 100644 --- a/src/config.js +++ b/src/config.js @@ -17,6 +17,12 @@ export const config = { embedModel: process.env.OPENAI_EMBED_MODEL || 'text-embedding-3-small', preferredChannel: process.env.BOT_CHANNEL_ID || null, enableWebSearch: process.env.ENABLE_WEB_SEARCH !== 'false', + proxyScrapeEnabled: process.env.ENABLE_PROXY_SCRAPE !== 'false', + proxyScrapeEndpoint: + process.env.PROXYSCRAPE_ENDPOINT + || 'https://api.proxyscrape.com/v4/free-proxy-list/get?request=getproxies&protocol=http&timeout=8000&country=all&ssl=yes&anonymity=elite&limit=200', + proxyScrapeRefreshMs: Number(process.env.PROXYSCRAPE_REFRESH_MS || 10 * 60 * 1000), + proxyScrapeMaxAttempts: Number(process.env.PROXYSCRAPE_ATTEMPTS || 5), coderUserId: process.env.CODER_USER_ID || null, maxCoderPingIntervalMs: 6 * 60 * 60 * 1000, shortTermLimit: 10, diff --git a/src/search.js b/src/search.js index afa5307..c34e52a 100644 --- a/src/search.js +++ b/src/search.js @@ -1,7 +1,20 @@ import { load as loadHtml } from 'cheerio'; +import { promises as fs } from 'fs'; +import path from 'path'; +import { ProxyAgent } from 'undici'; +import { config } from './config.js'; + +const logFile = path.resolve('data', 'search.log'); +const filterFile = path.resolve('data', 'filter.txt'); const cache = new Map(); const CACHE_TTL_MS = 10 * 60 * 1000; // 10 minutes +const FILTER_CACHE_TTL_MS = 5 * 60 * 1000; // 5 minutes + +let cachedFilters = { terms: [], expires: 0 }; +let proxyPool = []; +let proxyPoolExpires = 0; +let proxyCursor = 0; function makeCacheKey(query) { return query.trim().toLowerCase(); @@ -34,25 +47,187 @@ function absoluteUrl(href) { return `https://duckduckgo.com${href}`; } -export async function searchWeb(query, limit = 3) { - if (!query?.trim()) return []; - const cached = getCache(query); - if (cached) return cached; - - const params = new URLSearchParams({ q: query, kl: 'us-en' }); - const response = await fetch(`https://duckduckgo.com/html/?${params.toString()}`, { - headers: { - 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0 Safari/537.36', - Accept: 'text/html', - }, - }); - - if (!response.ok) { - console.warn(`[search] DuckDuckGo request failed with status ${response.status}`); +async function loadBlockedTerms() { + if (Date.now() < cachedFilters.expires) { + return cachedFilters.terms; + } + try { + const raw = await fs.readFile(filterFile, 'utf-8'); + const terms = raw + .split(/\r?\n/) + .map((line) => line.trim().toLowerCase()) + .filter((line) => line && !line.startsWith('#')); + cachedFilters = { terms, expires: Date.now() + FILTER_CACHE_TTL_MS }; + return terms; + } catch (error) { + if (error.code !== 'ENOENT') { + console.warn('[search] Failed to read filter list:', error.message); + } + cachedFilters = { terms: [], expires: Date.now() + FILTER_CACHE_TTL_MS }; return []; } +} - const html = await response.text(); +async function findBlockedTerm(query) { + if (!query) return null; + const lowered = query.toLowerCase(); + const terms = await loadBlockedTerms(); + return terms.find((term) => lowered.includes(term)) || null; +} + +function createBlockedError(term) { + const error = new Error('Search blocked by filter'); + error.code = 'SEARCH_BLOCKED'; + error.blockedTerm = term; + return error; +} + +function createProxyUnavailableError(reason) { + const error = new Error(reason || 'Proxy network unavailable'); + error.code = 'SEARCH_PROXY_UNAVAILABLE'; + return error; +} + +function parseProxyList(raw) { + if (!raw) return []; + return raw + .split(/\r?\n/) + .map((line) => line.trim()) + .filter((line) => line && !line.startsWith('#')); +} + +function removeProxyFromPool(proxy) { + if (!proxy) return; + proxyPool = proxyPool.filter((entry) => entry !== proxy); + if (!proxyPool.length) { + proxyPoolExpires = 0; + proxyCursor = 0; + } +} + +async function hydrateProxyPool() { + if (!config.proxyScrapeEnabled) { + proxyPool = []; + proxyPoolExpires = 0; + proxyCursor = 0; + return; + } + const endpoint = config.proxyScrapeEndpoint; + const response = await fetch(endpoint, { + headers: { + Accept: 'text/plain', + 'User-Agent': 'NovaBot/1.0 (+https://github.com/) ProxyScrape client', + }, + }); + if (!response.ok) { + throw createProxyUnavailableError(`Failed to fetch proxy list (HTTP ${response.status})`); + } + const text = await response.text(); + const proxies = parseProxyList(text); + if (!proxies.length) { + throw createProxyUnavailableError('Proxy list came back empty'); + } + proxyPool = proxies; + proxyPoolExpires = Date.now() + (config.proxyScrapeRefreshMs || 10 * 60 * 1000); + proxyCursor = 0; +} + +async function ensureProxyPool() { + if (!config.proxyScrapeEnabled) return; + if (proxyPool.length && Date.now() < proxyPoolExpires) { + return; + } + await hydrateProxyPool(); +} + +async function getProxyInfo() { + await ensureProxyPool(); + if (!config.proxyScrapeEnabled || !proxyPool.length) { + return null; + } + const proxy = proxyPool[proxyCursor % proxyPool.length]; + proxyCursor = (proxyCursor + 1) % proxyPool.length; + return { + proxy, + agent: new ProxyAgent(`http://${proxy}`), + }; +} + +async function fetchDuckDuckGoHtml(url, headers) { + const maxAttempts = config.proxyScrapeEnabled + ? Math.max(1, config.proxyScrapeMaxAttempts || 5) + : 1; + let lastError = null; + + for (let attempt = 0; attempt < maxAttempts; attempt += 1) { + let proxyInfo = null; + try { + const options = { headers }; + if (config.proxyScrapeEnabled) { + proxyInfo = await getProxyInfo(); + if (!proxyInfo) { + throw createProxyUnavailableError('No proxies available'); + } + options.dispatcher = proxyInfo.agent; + } + const response = await fetch(url, options); + if (!response.ok) { + throw new Error(`DuckDuckGo request failed (${response.status})`); + } + const html = await response.text(); + return { + html, + proxy: proxyInfo?.proxy || null, + }; + } catch (error) { + lastError = error; + if (!config.proxyScrapeEnabled) { + break; + } + if (proxyInfo?.proxy) { + removeProxyFromPool(proxyInfo.proxy); + } + } + } + + if (config.proxyScrapeEnabled) { + throw createProxyUnavailableError(lastError?.message || 'All proxies failed'); + } + throw lastError || new Error('DuckDuckGo fetch failed'); +} + +export async function searchWeb(query, limit = 3) { + if (!query?.trim()) { + return { results: [], proxy: null, fromCache: false }; + } + const blockedTerm = await findBlockedTerm(query); + if (blockedTerm) { + throw createBlockedError(blockedTerm); + } + const cached = getCache(query); + if (cached) { + return { results: cached, proxy: 'cache', fromCache: true }; + } + + const params = new URLSearchParams({ q: query, kl: 'us-en' }); + const headers = { + 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0 Safari/537.36', + Accept: 'text/html', + }; + + let html; + let proxyLabel = null; + try { + const { html: fetchedHtml, proxy } = await fetchDuckDuckGoHtml(`https://duckduckgo.com/html/?${params.toString()}`, headers); + html = fetchedHtml; + proxyLabel = config.proxyScrapeEnabled ? proxy || 'proxy-unknown' : 'direct'; + } catch (error) { + if (error?.code === 'SEARCH_PROXY_UNAVAILABLE') { + throw error; + } + console.warn('[search] DuckDuckGo request failed:', error); + return { results: [], proxy: null, fromCache: false }; + } const $ = loadHtml(html); const results = []; @@ -68,5 +243,21 @@ export async function searchWeb(query, limit = 3) { }); setCache(query, results); - return results; + return { results, proxy: proxyLabel || (config.proxyScrapeEnabled ? 'proxy-unknown' : 'direct'), fromCache: false }; +} + +export async function appendSearchLog({ userId, query, results, proxy }) { + try { + await fs.mkdir(path.dirname(logFile), { recursive: true }); + const timestamp = new Date().toISOString(); + const proxyTag = proxy || 'direct'; + const lines = [ + `time=${timestamp} user=${userId} proxy=${proxyTag} query=${JSON.stringify(query)}`, + ...results.map((entry, idx) => ` ${idx + 1}. ${entry.title} :: ${entry.url} :: ${entry.snippet}`), + '', + ]; + await fs.appendFile(logFile, `${lines.join('\n')}`); + } catch (error) { + console.warn('[search] failed to append log', error); + } }